Autotuning Divide-and-Conquer Matrix-Vector Multiplication

ثبت نشده

چکیده

Divide and conquer is an important concept in computer science. It is used ubiquitously to simplify and speed up programs. However, it needs to be optimized, with respect to parameter settings for example, in order to achieve the best performance. The problem boils down to searching for the best implementation choice on a given set of requirements, such as which machine the program is running on. The goal of this thesis is to apply and evaluate the Ztune approach [14] on serial divide-and-conquer matrix-vector multiplication. We implemented Ztune to autotune serial divide-and-conquer matrix-vector multiplication on machines with different hardware configurations, and found that Ztuneoptimized codes ran 1%-5% faster than the hand-optimized counterparts. We also compared Ztune-optimized results with other matrix-vector multiplication libraries including the Intel Math Kernel Library and OpenBLAS. Since the matrix-vector multiplication problem is a level 2 BLAS, it is not as computationally intensive as level 3 BLAS problems such as matrix-matrix multiplication and stencil computation. As a result, the measurement in matrix-vector multiplication is more prone to error from factors such as noise, cache alignment of the matrix, and cache states, which lead to wrong decision choices for Ztune. We explored multiple options to get more accurate measurements and demonstrated the techniques that remedied these issues. Lastly, we applied the Ztune approach to matrix-matrix multiplication, and we were able to achieve 2%-85% speedup compared to the hand-tuned code. This thesis represents joint work with Ekanathan Palamadai Natarajan. Thesis Supervisor: Professor Charles E. Leiserson Title: Edwin Sibley Webster Professor in Electrical Engineering and Computer Science 1This research was supported in part by NSF Grants 1314547 and 1533644.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Autotuning divide-and-conquer stencil computations

This paper explores autotuning strategies for serial divide-and-conquer stencil computations, comparing the efficacy of traditional “heuristic” autotuning with that of “pruned-exhaustive” autotuning. We present a pruned-exhaustive autotuner called Ztune that searches for optimal divide-and-conquer trees for stencil computations. Ztune uses three pruning properties — space-time equivalence, divi...

متن کامل

Divide and conquer the Hilbert space of translation-symmetric spin systems.

Iterative methods that operate with the full Hamiltonian matrix in the untrimmed Hilbert space of a finite system continue to be important tools for the study of one- and two-dimensional quantum spin models, in particular in the presence of frustration. To reach sensible system sizes such numerical calculations heavily depend on the use of symmetries. We describe a divide-and-conquer strategy f...

متن کامل

Parallelization of Divide-and-Conquer by Translation to Nested Loops

We propose a sequence of equational transformations and specializations which turns a divide-and-conquer skeleton in Haskell into a parallel loop nest in C. Our initial skeleton is often viewed as general divide-and-conquer. The spe-cializations impose a balanced call tree, a xed degree of the problem division, and elementwise operations. Our goal is to select parallel implementations of divide...

متن کامل

Transformation of Divide & Conquer to Nested Parallel Loops

We propose a sequence of equational transformations and specializations which turns a divide-and-conquer skeleton in Haskell into a parallel loop nest in C. Our initial skeleton is often viewed as general divide-and-conquer. The specializations impose a balanced call tree, a xed degree of the problem division, and elementwise operations. Our goal is to select parallel implementations of divide-...

متن کامل

Optimizing Skeletal Stream Processing for Divide and Conquer

Algorithmic skeletons intend to simplify parallel programming by providing recurring forms of program structure as predefined components. We present a new distributed task parallel skeleton for a very general class of divide and conquer algorithms for MIMD machines with distributed memory. Our approach combines skeletal internal task parallelism with stream parallelism. This approach is compare...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

Autotuning Divide-and-Conquer Matrix-Vector Multiplication

ثبت نشده

چکیده

منابع مشابه

Autotuning divide-and-conquer stencil computations

Divide and conquer the Hilbert space of translation-symmetric spin systems.

Parallelization of Divide-and-Conquer by Translation to Nested Loops

Transformation of Divide & Conquer to Nested Parallel Loops

Optimizing Skeletal Stream Processing for Divide and Conquer

عنوان ژورنال:

اشتراک گذاری